Method of determining base sequence of nucleic acid

ABSTRACT

To enable accurate analysis of a base sequence even in an electrophoretic pattern containing a degraded part. The base sequence of a nucleic acid is determined by conducting the following steps (A) to (C) in this order: (A) a basic peak extracting step wherein basic peaks are extracted from electrophoretic data involving the respective peaks of the four bases obtained by electrophoresing a sample nucleic acid; (B) a condition determining step wherein a basic peak at the search starting point, from which the search is started, and a standard peak-to-peak distance are determined based on the time-series data composed of the basic peaks extracted above; and (C) a base sequence determining step wherein peak-to-peak intervals are successively scanned forward and backward in the above-described time-series data starting from the basic peak at the search starting point and then the peak-to-peak distance is compared with the standard peak-to-peak distance as determined above so as to add an interpolation peak to a peak-missing area.

TECHNICAL FIELD

The present invention relates to a method for determining the basesequence of a nucleic acid such as DNA (deoxyribonucleic acid) from dataobtained by measurement using a genetic analyzer utilizingelectrophoresis.

BACKGROUND ART

When a genetic analyzer is used to determine the base sequence of DNA,four time-series data sets corresponding to four kinds of bases, A(adenine), G (guanine), C (cytosine), and T (thymine) are obtained.

According to a conventional base sequence determination method such as amethod using Phred, prior to peak detection, preprocessing includingnoise elimination and mobility correction is performed on thesetime-series data sets. Mobility correction is performed to correct thedeviation among the time-series data sets caused by a difference inmigration speed among dyes used to label four kinds of bases. Then, peakdetection is performed to determine peaks with a large peak height or alarge peak area as the peaks of bases (see Non-Patent Document 1).

In measurement using electrophoresis, signal characteristics are widelychanged among the initial, middle, and final stages of migration. Basedon this fact, one of the present inventors has proposed a noiseelimination method in which electrophoretic data is separated into someparts and noise elimination is performed on each of the parts inconsideration of such a change in signal characteristics (see PatentDocument 1).

Further, when a genetic analyzer is used to detect respective peaks offour kinds of bases in chronological order, there is a case where peakappearance time relatively deviates due to, for example, a difference inmigration speed among dyes used to label four kinds of bases. In orderto correct the deviation of peak appearance time, one of the presentinventors has also proposed a method in which time-series signals areshifted with respect to one another so that the total area of portionswhere detected peaks overlap each other is minimized (see PatentDocument 2).

-   Non-Patent Document 1: Ewing B, Hillier L, Wendl M, Green P:    Base-calling of automated sequencer traces using phred. I. Accuracy    assessment. Genome Research 8:175-185, 1998-   Patent Document 1: Japanese Patent Application Laid-open No.    2002-202286-   Patent Document 2: Japanese Patent Application Laid-open No.    2002-228633

DISCLOSURE OF THE INVENTION Problem to be Solved by the Invention

In electrophoretic data measured in the final stage of migration, thereis a part where it is difficult to detect a peak because its peak shapeis unclear due to peak broadening or a small peak is hidden in the footof a broad peak. Therefore, electrophoretic data measured in the finalstage of migration is more likely to lack peak information that is aclue to determining a base sequence, and therefore cannot be accuratelyanalyzed by a conventional base sequence determination method, such as amethod using Phred, which depends on only electrophoretic data based onpeaks detected by a peak detection operation to determine a basesequence.

In general, in the middle stage of migration, peaks with clear shapesare stably observed at regular intervals, but there is a case whereelectrophoretic data is partially degraded for a reason resulting from,for example, the reaction state of a measurement object. When datadegradation occurs in electrophoretic data, the electrophoretic datacannot be accurately analyzed by a conventional processing method, suchas a method using Phred, based on the premise that electrophoretic datato be analyzed is stable, even when it is electrophoretic data measuredin the middle stage of migration.

It is an object of the present invention to provide a method capable ofaccurately determining a base sequence even from electrophoretic datacontaining a degraded part.

Means for Solving the Problem

The present invention is directed to a method for determining the basesequence of a nucleic acid, including the following steps (A) to (C) inthis order: (A) a basic peak extracting step in which basic peaks areextracted from electrophoretic data containing respective peaks of fourkinds of bases obtained by electrophoresing a sample nucleic acid; (B) acondition determining step in which a basic peak at search startingpoint from which a search is started and a standard peak-to-peakdistance are determined based on time-series data composed of theextracted basic peaks; and (C) a base sequence determining step in whicha search is started from the basic peak at search starting point tosequentially scan intervals between adjacent basic peaks in temporalforward and backward directions in the time-series data and the distanceof each interval between adjacent basic peaks is compared with thestandard peak-to-peak distance to add an interpolation peak to apeak-missing area to determine a base sequence of the sample nucleicacid.

The basic peak refers to a clear peak of a base. In the basic peakextracting step, basic peaks are extracted, for example, in thefollowing manner. The peak height and/or the peak area of a peak thatcan be obviously regarded as a peak of a base are/is set as a thresholdvalue(s), and then the peak height and/or the peak area of each peakcontained in total time-series data (i.e., electrophoretic data) are/iscompared with the threshold value(s) to extract peaks having a peakheight or a peak area larger than the threshold value as basic peaks.

Then, after the completion of the condition determining step, the basesequence determining step is performed to add an interpolation peak to apeak-missing area. More specifically, the distance of each intervalbetween adjacent basic peaks is compared with the standard peak-to-peakdistance, and then a required number of interpolation peaks are added toa peak-missing area so that a peak-to-peak distance comes close to thestandard peak-to-peak distance.

The kind of base of an interpolation peak to be added is determinedbased on electrophoretic data located at a peak position where theinterpolation peak is to be added. A peak located at a position where aninterpolation peak is to be added is a signal whose level is too low tobe detected as a basic peak.

By providing the step of adding an interpolation peak, it is possible toreproduce, as an interpolation peak, a peak that appears in the finalstage of migration but is difficult to detect due to its unclear peakshape or a small peak hidden in the foot of an adjacent broad peak dueto its low peak height or small peak area. Further, it is also possibleto obtain accurate base sequence information even when peak shapedegradation occurs for any reason in part of electrophoretic data, whichincludes a case where data degradation occurs in the middle stage ofmigration.

It is preferred that the method for determining a base sequenceaccording to claim 1 further includes the following steps (a) to (c) inthis order between the steps (A) and (B): (a) an area dividing step inwhich time-series data composed of the extracted basic peaks is dividedinto small areas; (b) a best area extracting step in which the best areashowing the best arrangement of basic peaks is extracted from thedivided areas; and (c) a mobility correction step in which a mobilitycorrection amount is calculated from the extracted best area based on adifference in mobility among four kinds of bases, and then mobilitycorrection is performed on the total time-series data based on themobility correction amount, wherein the time-series data having beensubjected to mobility correction in the step (c) is used as time-seriesdata in the step (B).

The size of each small area divided in the area dividing step (a) ispreviously determined so that the distance between adjacent peaks can beregarded as substantially constant in each small area. If the size ofeach small area is too large, the distance between adjacent peaks variesin each small area, and therefore it is difficult to differentiatebetween a good area and a bad area. On the other hand, if the size ofeach small area is too small, the number of peaks contained in eachsmall area is small so that the amount of information usable for adiscrimination operation becomes small. For this reason, the size ofeach small area is preferably determined so that each small areacontains, for example, about 100 to 300 bp.

In the best area extracting step (b), the best area is extracted fromsmall areas obtained by dividing the time-series data in the areadividing step. At this time, evaluation of the arrangement of basicpeaks is performed on each small area based on the characteristics ofelectrophoretic data that clear peaks of bases that can be extracted asbasic peaks appear more evenly at more regular intervals in the datawhen migration is more successfully performed. Then, a small areashowing the best arrangement of basic peaks is extracted as the bestarea.

The best area extracting step (b) can be performed by, for example,performing the best mobility correction on each small area based on adifference in mobility among four kinds of bases and then extracting asmall area having the largest total peak area as the best area. The bestmobility correction can be performed, for example, in the followingmanner. Four kinds of detection data sets corresponding to four kinds ofbases are overlaid together in each small area, and then while one tothree kinds of detection data sets out of the four kinds of detectiondata sets are fixed, the remaining detection data set(s) is(are) shiftedin temporal forward and backward directions so that the total area of apeak waveform is maximized.

The best area extracting step (b) can be performed also by, for example,measuring the distances between adjacent basic peaks in each small areaand then extracting a small area having the smallest dispersion indistances between adjacent basic peaks as the best area. The dispersionin distances between adjacent basic peaks can be evaluated by, forexample, the ratio between a minimum distance between adjacent basicpeaks (Dmin) and a maximum distance between adjacent basic peaks (Dmax)in each small area. In this case, a larger ratio of Dmin/Dmax indicatesa smaller dispersion in distances between adjacent basic peaks. When itis judged that a small area contains only a repeated sequence, theevaluation of the small area is preferably downgraded by multiplying itsratio of Dmin/Dmax by a value smaller than 1.

In the area dividing step (a), adjacent small areas may be overlappedwith each other to increase the number of areas.

In the mobility correction step (c), four kinds of basic peakscorresponding to four kinds of bases are overlaid together in the bestarea, and then while one to three kinds of basic peaks out of the fourkinds of basic peaks are fixed, the remaining basic peaks are shifted intemporal forward and backward directions so that the total area of apeak waveform is maximized to determine a shift amount(s) of one tothree kinds of bases. The shift amount(s) of one to three kinds of baseswhereby the total area of a peak waveform is maximized is(are) themobility correction amount(s) of the one to three kinds of bases.

In a case where a mobility correction amount is calculated in the bestarea extracting step (b), the mobility correction amount may be used inthe mobility correction step (c).

The basic peak at search starting point determined in the conditiondetermining step (B) may be a basic peak having the largest peak area inthe best area.

The standard peak-to-peak distance determined in the conditiondetermining step (B) may be an average distance between adjacent basicpeaks calculated from a part of the best area where basic peaks arespaced at regular intervals.

Effects of the Invention

According to the present invention, basic peaks are first extracted andthen an interpolation peak is added to a peak-missing area betweenadjacent basic peaks, and therefore it is possible to secure reliableinformation and then pick up less reliable information using thereliable information. Therefore, it is possible to keep a high degree ofaccuracy in reading base sequence information even in a case where datais partially degraded and therefore peak detection cannot be performeddue to signal waveform degradation or a peak is so small that it ishidden in the foot of an adjacent peak, such as a case where data ispartially degraded for a reason resulting from, for example, thereaction state of a measurement object.

Further, by performing mobility correction using the mobility correctionmeans based on a mobility correction amount calculated from the bestarea, it is possible to reduce the occurrence of overtaking oroverlapping between the peaks of bases caused by a difference inmigration speed among four kinds of bases. Further, by providing thearea dividing step, the best area extracting step, and the mobilitycorrection step, it is possible to accurately detect a peak-missing areain the base sequence determining step and properly add an interpolationpeak to the peak-missing area, thereby improving the degree of accuracyin determining a base sequence. This makes it possible to moreaccurately read the information of a longer base sequence from datameasured by a genetic analyzer.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing functions for carrying out a methodaccording to one embodiment of the present invention.

FIG. 2 is a flow chart of the method according to one embodiment of thepresent invention.

FIG. 3 is a graph showing electrophoretic data of four kinds of bases.

FIG. 4 is a graph showing electrophoretic data obtained by performingmobility correction.

FIG. 5 is a graph showing electrophoretic data obtained by furtheradding interpolation peaks to the electrophoretic data shown in FIG. 4.

FIG. 6A is a graph showing electrophoretic data in which basic peaks arespaced at substantially regular intervals.

FIG. 6B is a graph showing electrophoretic data in which basic peaks arespaced at irregular intervals.

DESCRIPTION OF THE REFERENCE NUMERALS

-   10 computer-   12 basic peak extracting means-   14 best area extracting means-   16 mobility correction means-   20 base sequence determining means

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 is a block diagram showing functions for performing processingsteps included in a method according to one embodiment of the presentinvention. A basic peak extracting means 12 is a function for performinga basic peak extracting step. A best area extracting means 14 is afunction for performing an area dividing step and a best area extractingstep. A mobility correction means 16 is a function for performing amobility correction step, and the mobility correction step is performedon total time-series data. A base sequence determining means 20 performsa condition determining step and then a base sequence determining stepon the total time-series data incorporated by the mobility correctionmeans 16. In the base sequence determining step, an interpolation peakis added to a peak-missing area.

The basic peak extracting means 12, the best area extracting means 14,the mobility correction means 16, and the base sequence determiningmeans 20 are functions that are achieved by a computer 10. The computer10 is a computer dedicated to an electrophoresis apparatus or ageneral-purpose personal computer.

Electrophoretic data is detected by an electrophoresis apparatus and isfirst stored in a memory device, and is then incorporated into thecomputer 10 and processed. A base sequence determined by the basesequence determining means 20 is outputted to a recorder or a display,and is then stored in a memory device or sent to another computer.

In a case where it is not necessary to perform mobility correctionbecause a difference in migration speed among dyes used to label fourkinds of bases, A, G, C, and T is not so large, the mobility correctionstep performed by the mobility correction means 16 may be omitted. Inthis case, after the basic peak extracting step is performed by thebasic peak extracting means 12, the base sequence determining step isperformed by the base sequence determining means 20 without performingthe mobility correction step.

FIG. 2 is a flow chart of the method according to one embodiment of thepresent invention. FIG. 3 shows examples of signal waveforms obtained aselectrophoretic data by electrophoresis measurement performed using dyesfor labeling four kinds of bases, A, G, C, and T. In FIG. 3, ahorizontal axis represents time, and values plotted on the horizontalaxis are the number of scans performed by a detection part of anelectrophoresis apparatus using exciting light to detect fluorescence.The number of scans is proportional to time. In FIG. 3, a vertical axisrepresents the intensity of fluorescence detected at the respectivefluorescence wavelengths of the labeling dyes.

The method according to the present embodiment processes electrophoreticdata, such as one shown in FIG. 3, in which time-series data sets aretemporal deviated with respect to one another due to a difference inmigration speed among labeling dyes. Hereinbelow, operations forcarrying out the method according to the present embodiment will bedescribed with reference to FIG. 1 which is a block diagram showingfunctions for carrying out the method and FIG. 2 which is a flow chartof the method.

(Basic Peak Extraction)

First, the basic peak extracting step is performed by the basic peakextracting means 12 on each peak contained in the respective time-seriesdata sets of four kinds of bases shown in FIG. 3 to extract basic peaks,or clear peaks.

In order to extract basic peaks, it is also necessary to perform a peakdetection operation. Such a peak detection operation can be performed bya known method generally used. Extraction of basic peaks is performedby, for example, sequentially searching a point, at which the gradientof a measured data signal value is changed from positive to negative, asa candidate for a peak top and then extracting, as basic peaks, onlypeaks whose signal value at the peak top (i.e., peak height) and/or peakarea are/is larger than their respective predetermined threshold valuesIth.

In actual measured data, some peaks of bases have a small peak height orunclear peak shape. If a threshold value is set low, noise waveforms aswell as such peaks of bases are extracted as basic peaks. Therefore, inthis step, a threshold value is set high so that noise waveforms can bereliably eliminated, that is, only the peaks of bases can be reliablyextracted as basic peaks.

(Best Area Extraction)

Then, the area dividing step and the best area extracting step areperformed by the best area extracting means 14. In the area dividingstep, time-series data composed of the extracted basic peaks is dividedinto small areas. The size of each small area is not particularlylimited, but is set so that each small area contains 100 to 300 bp. Anarea shown in FIG. 3 is a part of one small area, that is, it is smallerthan one small area.

The arrangement of basic peaks in each divided small area is evaluatedto extract “the best area” showing the best arrangement of basic peaks.In this step, evaluation of arrangement of basic peaks is performed oneach small area based on the characteristics of electrophoretic datathat when migration is more successfully performed, basic peaks appearmore evenly at more regular intervals in the data.

The evaluation of arrangement of basic peaks can be performed by, forexample, a method based on the findings that the best mobilitycorrection amount is a shift amount whereby the total area of a peakwaveform is maximized (see Patent Document 2). According to this method,the evaluation value of each small area is a total peak area calculatedafter the best mobility correction is performed. Therefore, a small areahaving the largest total peak area is extracted as the best area. Morespecifically, four kinds of detection data sets corresponding to fourkinds of bases are overlaid together in each small area, and then whileone to three kinds of detection data sets out of the four kinds ofdetection data sets are fixed, the remaining detection data set(s)is(are) shifted in temporal forward and backward directions so that thetotal area of a peak waveform is maximized. In this way, the bestmobility correction is performed on each small area. Then, a small areawhose total area of a peak waveform calculated after the best mobilitycorrection is performed is largest is selected as the best area.

The evaluation of each small area can be performed also by anothermethod in which distances between adjacent basic peaks are measured ineach small area and the dispersion in distances between adjacent basicpeaks is evaluated to extract a small area having the smallestdispersion in distances between adjacent basic peaks as the best area.FIGS. 6(A) and 6(B) are graphs showing examples of small areas differentin the dispersion in the distances between adjacent basic peaks. Morespecifically, FIG. 6(A) is a graph showing an example of a small area inwhich basic peaks are spaced at substantially regular intervals, andFIG. 6(B) is a graph showing an example of a small area in which basicpeaks are spaced at irregular intervals because some peaks of bases havenot been detected as basic peaks due to their unclear peak shape or lowpeak height. These two examples are greatly different in the dispersionin distances between adjacent basic peaks. The dispersion in distancesbetween adjacent basic peaks is evaluated by, for example, utilizing theratio between a minimum distance between adjacent basic peaks (Dmin) anda maximum distance between adjacent basic peaks (Dmax) in each smallarea. More specifically, when the value of Dmin/Dmax is larger, thedispersion in distances between adjacent basic peaks is smaller, and onthe other hand, when the value of Dmin/Dmax is smaller, the dispersionin distances between adjacent basic peaks is larger. In FIGS. 6(A) and6(B), the value of Dmin1/Dmax1 is larger than the value of Dmin2/Dmax2,and therefore it can be judged that the dispersion in distances betweenadjacent basic peaks is smaller in FIG. 6(A) than in FIG. 6(B).

By evaluating all the small areas k=1, 2, . . . N using the followingformula (1) for determining an evaluation value in such a manner asdescribed above to detect a small area kb having the largest evaluationvalue, it is possible to select a small area having the smallestdispersion in distances between adjacent basic peaks as “the best area”.Evaluation (k)=Dmin(k)/Dmax(k)  (1)

In general, one of the difficult problems common to base sequencedetermination methods is that accurate mobility correction cannot beperformed when an analysis object is a gene sequence called “repeatedsequence” in which only a certain kind(s) of bases selected from fourkinds of bases A, G, C, and T are continuously arranged. In the case ofusing the above-described evaluation method in which the dispersion indistances between adjacent basic peaks of each small area is determinedto extract the best area, it is preferred that the evaluation of a smallarea containing only a “repeated sequence” is downgraded. Whether or nota small area contains only a “repeated sequence” can be determined bychecking how many kinds of bases are contained in the small area. Whenit is judged that the small area contains only a “repeated sequence”,the evaluation of the small area is downgraded by, for example,multiplying the evaluation value of the small area obtained by the aboveformula (1) by a value smaller than 1, for example, 0.5 to select thebest area from small areas other than the small area containing only a“repeated sequence”.

When the time-series data is divided into small areas, adjacent smallareas may be overlapped with each other to increase the number of areas.

In this way, a small area containing basic peaks of four kinds of basesand exhibiting a small dispersion in distances between adjacent basicpeaks is extracted as the best area.

(Mobility Correction)

Then, a mobility correction amount is calculated by the mobilitycorrection means 16 from “the best area” extracted in such a manner asdescribed above. Mobility correction is performed, for example, in thefollowing manner. A signal string composed of the basic peaks of, forexample, a base G is fixed, and then a signal string composed of thebasic peaks of a base A, a signal string composed of the basic peaks ofa base C, and a signal string composed of the basic peaks of a base Tare shifted along the temporal axis, wherein the shift amounts of thesesignal strings are defined as ΔA(i), ΔC(j), and ΔT(k), respectively.ΔA(i)=0,±1,±2, (wherein i=1, 2, 3, . . . )ΔC(j)=0,±1,±2, (wherein j=1, 2, 3, . . . )ΔT(k)=0,±1,±2, (wherein k=1, 2, 3, . . . )

The total area S (i, j, k) of basic peaks contained in the best area isdetermined for each combination of the shift amounts ΔA(i), ΔC(j), andΔT(k). The shift amounts ΔA(ib), ΔC(jb), and ΔT(kb), the combination ofwhich makes the total area S (i, j, k) maximum, are the mobilitycorrection amounts of bases A, C, and T, respectively.

It is to be noted that when a mobility correction amount is calculatedto extract “the best area” in the best area extracting step, themobility correction amount can be used in the mobility correction step.

The mobility correction amount(s) is(are) applied to the total data toperform mobility correction.

(Addition of Interpolation Peak)

FIG. 4 is a graph showing electrophoretic data obtained by performingmobility correction on signal strings shown in FIG. 3. As shown in FIG.4, the deviation among the time-series data sets is corrected bymobility correction so that overlapping or overtaking between peaks iseliminated. As a result, all the peaks are spaced at substantiallyregular intervals. In FIG. 4, impulse signals indicate peak detectionpositions. However, the basic peaks do not include all the peaks ofbases, and some peaks of bases indicated by dashed arrows in FIG. 4 aremissed.

Therefore, the base sequence determining means 20 is operated to detectthese missing peaks of bases and add interpolation peaks. Here, a basicpeak at search starting point and a standard peak-to-peak distance aredetermined, and then a search is started from the basic peak at searchstarting point to sequentially scan intervals between adjacent basicpeaks in temporal forward and backward directions and a required numberof interpolation peaks are added to a peak-missing area.

The basic peak at search starting point is determined so as to satisfythe requirement that, for example, its peak area is largest in “the bestarea”.

The standard peak-to-peak distance is determined by, for example,calculating an average peak-to-peak distance from a part of the bestarea where basic peaks are spaced at regular intervals. Further, thestandard peak-to-peak distance may be monotonously and graduallydecreased during sequential scanning in temporal forward and backwarddirections based on the characteristics of electrophoresis that apeak-to-peak distance is gradually decreased from the middle stage tothe initial stage of migration and from the middle stage to the finalstage of migration. Alternatively, the standard peak-to-peak distancemay be gradually changed from area to area. In this case, the averagepeak-to-peak distance of an area, on which searching of a peak-missingarea and addition of an interpolation peak have already been performed,is used as the standard peak-to-peak distance of the next area.

A method for searching a peak-missing area and adding an interpolationpeak will be described below with reference to FIG. 4. The averagepeak-to-peak distance d of a part located on the right side of an areashown in FIG. 4 (i.e., a part on the side where the base pair number islarger) where peaks are extracted at substantially regular intervals isdetermined, and the average peak-to-peak distance d is defined as astandard peak-to-peak distance. The peak-to-peak distance of the nextpart is expected to be close to d, but is actually d′ as shown in FIG.4. In this case, a new peak-to-peak distance NewD(n) can be expressed bythe following formula (2) assuming that n interpolation peaks are addedto this area.New D(n)=d′/(n+1)  (2)wherein n=0, 1, or 2

Then, n is determined so that a new peak-to-peak distance NewD(n)becomes closest to the standard peak-to-peak distance d, and the thusdetermined n is defined as the number of interpolation peaks that shouldbe added. The added interpolation peaks are detected at positions spacedapart by a distance NewD(n).

The kind of base of an interpolation peak to be added is determined inthe following manner. In the case of the above example, assumed timecenters to which interpolation peaks are to be added so that basic peakslocated at both ends of a peak-to-peak interval having a distance d′ andthe interpolation peaks are spaced at regular intervals are determined.Then, the threshold value is lowered at around each assumed time centerto detect a peak. When a peak is detected, the kind of base of the peakis determined as the kind of base of an interpolation peak to be added.On the other hand, when it is difficult to detect a peak due to itsunclear peak shape, the kind of base of time-series data having thehighest signal value is selected as the kind of base of an interpolationpeak to be added. As a result of addition of interpolation peaks, a peaksequence as shown in FIG. 5 is finally obtained.

(Base Sequence Determination)

A base sequence can be determined by sequentially reading a peaksequence such as one shown in FIG. 5.

According to the present embodiment, since the best area is extracted byprocessing based on only basic peaks believed to give reliableinformation in the early stage of the method, and then mobilitycorrection is performed based on a mobility correction amount calculatedfrom the best area, disturbance caused by less reliable information isless likely to occur even when data having a degraded part is processed.Therefore, the method according to the present embodiment is less likelyto be influenced by partial change in electrophoretic data such as dataquality degradation, and therefore can determine a base sequence withstability.

INDUSTRIAL APPLICABILITY

The present invention can be applied to determination of a base sequencefrom measured data obtained by electrophresing a nucleic acid such asDNA or RNA.

1. A method for determining a base sequence of a nucleic acid,comprising the following steps (A) to (C) in this order: (A) a basicpeak extracting step in which basic peaks are extracted fromelectrophoretic data containing respective peaks of four kinds of basesobtained by electrophoresing a sample nucleic acid, the basic peaksbeing peaks having a peak height or a peak area larger than a thresholdvalue; (B) a condition determining step in which a basic peak at asearch starting point from which a search is started and a standardpeak-to-peak distance are determined based on time-series data composedof the extracted basic peaks; and (C) a base sequence determining stepin which a search is started from the basic peak at a search startingpoint to sequentially scan intervals between adjacent basic peaks intemporal forward and backward directions in the time-series data, andthen a distance of each interval between adjacent basic peaks iscompared with the standard peak-to-peak distance to add an interpolationpeak to a peak-missing area to determine a base sequence of the samplenucleic acid, wherein the method further comprises the following steps(a) to (c) in this order between the Steps (A) and (B): (a) an areadividing step in which time-series data composed of the extracted basicpeaks is divided into small areas; (b) a best area extracting step inwhich a best area showing a best arrangement of basic peaks is extractedfrom the divided areas; and (c) a mobility correcting step in which amobility correction amount is calculated from the extracted best areabased on a difference in mobility among four kinds of bases, and thenmobility correction is performed on the total time-series data based onthe mobility correction amount, wherein the time-series data having beensubjected to mobility correction in the step (c) is used as time-seriesdata in the step (B), and wherein in the best area extracting step (b),distances between adjacent basic peaks are measured in each small area,and an area having a smallest dispersion in distances between adjacentbasic peaks is extracted as the best area, and wherein the steps (A)-(C)and (a)-(c) are performed by a data processing computer.
 2. The methodfor determining a base sequence according to claim 1, wherein thedispersion in distances between adjacent basic peaks is evaluated by aratio between a minimum distance between adjacent basic peaks Dmin and amaximum distance between adjacent basic peaks Dmax in each small area,and wherein when the ratio of Dmin/Dmax is larger, the dispersion indistances between adjacent basic peaks is smaller.
 3. The method fordetermining a base sequence according to claim 2, wherein when it isdetermined that a small area contains only a repeated sequence, theevaluation of the small area is downgraded by multiplying the ratio ofDmin/Dmax of the small area by a value smaller than
 1. 4. The method fordetermining a base sequence according to claim 1, wherein in the areadividing step (a), adjacent small areas are overlapped with each otherto increase the number of areas.
 5. The method for determining a basesequence according to claim 1, wherein in the mobility correction step(c), four kinds of basic peaks corresponding to four kinds of bases areoverlaid together in the best area, and then while one to three kinds ofbasic peaks out of the four kinds of basic peaks are fixed, remainingbasic peaks are shifted in temporal forward and backward directions sothat a total area of a peak waveform is maximized to determine a shiftamount(s) of one to three kinds of bases, and wherein the shiftamount(s) of one to three kinds of bases whereby a total area of a peakwaveform is maximized is(are) defined as a mobility correction amount(s)of the one to three kinds of bases.
 6. The method for determining a basesequence according to claim 1, wherein the basic peak at search startingpoint determined in the condition determining step (B) is a peak havinga largest peak area in the best area.
 7. The method for determining abase sequence according to claim 1, wherein the standard peak-to-peakdistance determined in the condition determining step (B) is an averagepeak-to-peak distance calculated from a part of the best area wherebasic peaks are spaced at regular intervals.
 8. The method fordetermining a base sequence according to claim 1, wherein the dataprocessing computer has a basic peak extracting means, a best areaextracting means, a mobility correction means, and a base sequencedetermining means as functions achieved by the computer, wherein thebasic peak extracting means performs the basic peak extracting step, thebest area extracting means performs the area dividing step and the bestarea extracting step, the mobility correction means performs themobility correction step on total time-series data, and the basesequence determining means performs the condition determining step andthen the base sequence determining step on the total time-series dataincorporated by the mobility correction means, and wherein the dataprocessing computer outputs the base sequence determined by the basesequence determining means.